import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
df = pd.read_excel(r'C:\Users\admin\Desktop\KANAV BANSAL\KANAV TASKS\TASK - II - EDA - AMCAT DATA/reports.xlsx')
df
There are 3998 data of the people which contains the Gender, Designation, Salary, Location, Percentage of 10th, 12th and B.Tech including with College Tier -Specialization, etc things.
df.describe(include='all')
We are able to get the information of MEAN, MEDIAN, STANDARD DEVIATION, MINIMUM & MAXIMUM VALUES OF ENTIRE DATA.
X = df.head()
X
There are 3 males & 2 females, where one of the male has highest Salary Package who is a SENIOR SOFTWARE ENGINEER.
Y = df.tail()
Y
There are 3 females & 2 males, where one of the female has highest Salary Package who is a SENIOR SYSTEMS ENGINEER.
sns.distplot(df['Salary'])
sns.distplot(df['Salary'], kde = False, rug = True)
sns.distplot(df['Salary'], kde = False, rug = False)
sns.distplot(df['collegeGPA'], kde = False, rug = True)
sns.distplot(df['Logical'], kde = False, rug = True)
It's a figure-level function with a similar flexibility over the kind of plot to draw.
It's basically for univariant set of observations and visualizes it through a histogram i.e. only one observation and hence we choose one particular column of the dataset.
The plot shows a simple distribution when it creats a random values with random.randn().
sns.catplot(x = "Gender", y = "Salary", hue = "CollegeTier", kind = "point", data = df)
sns.catplot(x = "Gender", y = "Salary", hue = "Degree", kind = "point", data = df)
sns.catplot(x="collegeGPA", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data = X)
sns.catplot(x="collegeGPA", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data = Y)
sns.catplot(x="collegeGPA", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data=df)
sns.catplot(x="Degree", y="Salary", hue="Gender", markers=["^", "o"], linestyles=["-", "--"], kind="point", data = df)
Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables.
They are particularly adept at showing interactions: how the relationship between levels of one categorical variable changes across levels of a second categorical variable. The lines that join each point from the same hue level allow interactions to be judged by differences in slope, which is easier for the eyes than comparing the heights of several groups of points or bars.
It is important to keep in mind that a point plot shows only the mean (or other estimator) value, but in many cases it may be more informative to show the distribution of values at each level of the categorical variables. In that case, other approaches such as a box or violin plot may be more appropriate.
sns.violinplot(x = df.CollegeTier, y = df.collegeGPA)
sns.violinplot(x = df.Degree, y = df.collegeGPA)
sns.catplot(x= "Degree", y= "collegeGPA", hue= "Gender", kind= "violin", inner= "stick", split=True, palette="pastel", data= df)
sns.catplot(x = "Degree", y = "collegeGPA", hue = "Gender", kind = "violin", split = True, data = df)
sns.catplot(x = "collegeGPA", y = "Degree", hue = "Gender", kind = "violin", bw = .15, cut = 0, data = df)
sns.catplot(x = "Degree", y = "collegeGPA", hue = "Gender", kind = "violin", data = df)
The white dot represents the median, the thick gray bar in the center represents the interquartile range, the thin gray line represents the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the interquartile range.
Violin plot are made vertically most of the time, If you have long labels building an horizontal version{Output[22]} like above make the labels more readable.
If the variable are grouped, we can build a grouped violin as you would do for a boxplot.
sns.catplot(x = "Degree", y = "Salary", kind = "boxen", data = df.sort_values("collegeGPA"))
sns.catplot(x = "Degree", y = "Salary", hue = "Gender", kind = "box", data = df)
sns.boxplot(data = X, x='Specialization', y='Salary')
sns.boxplot(data = df, x = 'Specialization', y = 'Salary')
sns.boxplot(data = df, x = 'Designation', y = 'Salary')
sns.boxplot(data = df, x = 'GraduationYear', y = 'Salary')
sns.boxplot(data = df, x = 'CollegeTier', y = 'Salary')
sns.boxplot(data = df, x = 'CollegeTier', y = 'Degree')
sns.boxplot(data = X, x = 'collegeGPA', y = 'Designation')
sns.boxplot(data = df, x = 'Degree', y = 'Salary')
sns.boxplot(data = X, x = 'Salary', y = 'Designation', hue = 'Specialization')
sns.boxplot(data = Y, x = 'Specialization', y = 'Salary', hue = 'Designation')
sns.catplot(x = "English", y = "Logical", kind = "boxen", data = df.sort_values("Gender"))
plt.boxplot(df['Salary'])
plt.show()
plt.boxplot(df['collegeGPA'])
plt.show()
OUTLIERS: An outlier is an observation that is numerically distant from the rest of the data. Box plots are useful as they show outliers within a data set. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.
Comparision of the medians, the interquartile ranges and whiskers of box plots.
Gives the potential outliers and signs of Skewness.
sns.catplot(x = "Gender", y = "Salary", order = ["m", "f"], data = df)
sns.catplot(x = "English", y = "Quant", hue = "Gender", kind = "swarm", data = df)
sns.catplot(x = "GraduationYear", y = "collegeGPA", data = df)
sns.catplot(x = "Degree", y = "Salary", hue = "Gender", kind = "swarm", data = df)
sns.catplot(x = "CollegeTier", y = "collegeGPA", kind = "swarm", data = df)
sns.swarmplot(data = df, x = 'Specialization', y = 'Salary')
sns.swarmplot(data = df, x = 'Domain', y = 'English')
sns.swarmplot(data = df, x = 'Salary', y = 'Domain')
It can give a better representation of the distribution of observations, although it only works well for relatively small datasets.
Enlarging the plot and Separate points by hue using the argument split = True.
Place the legend to the right while Adjusting the y-axis limits to end at 0.
sns.jointplot(x = 'Salary', y = 'collegeGPA', data = df, kind = 'scatter')
sns.jointplot(x = 'collegeGPA', y = 'Salary', data = df, kind = 'scatter')
sns.jointplot(x = 'English', y = 'Logical', data = df, kind = 'scatter')
sns.jointplot(x = 'English', y = 'collegeGPA', data = df, kind = 'scatter')
sns.jointplot(x='Logical', y='Quant', data=df, kind = 'scatter')
plt.scatter(df['Salary'], df['collegeGPA'])
plt.show()
plt.scatter(df['Salary'], df['Specialization'])
plt.show()
plt.scatter(df['Gender'], df['Salary'])
plt.show()
plt.scatter(df['English'], df['Domain'])
plt.show()
plt.scatter(df['Quant'], df['Logical'])
plt.show()
plt.scatter(df['Degree'], df['CollegeState'])
plt.show()
plt.scatter(df['Degree'], df['CollegeTier'])
plt.show()
plt.scatter(df['collegeGPA'], df['Gender'])
plt.show()
Pairs of numerical figures are present.
Dependent variables have multiple values for each figure associated with the independent variable.
Defining if there is a relationship between two variables and only show correlation.
Discrete data is best at pass/ fail measurements, Continuous data lets you measure things deeply on an infinite set and is generally used in scatter analysis.
From the output, you can see that a joint plot has three parts.
A distribution plot at the top for the column on the x-axis, a distribution plot on the right for the column on the y-axis and a scatter plot in between that shows the mutual distribution of data for both the columns.
You can see that there is no correlation observed between the x, y variables as given in the input.
You can change the type of the joint plot by passing a value for the kind parameter.
sns.jointplot(x = 'collegeGPA', y = 'Salary', data = df, kind = 'hex', color = 'k')
sns.jointplot(x = 'English', y = 'Quant', data = df, kind = 'hex', color = 'b')
sns.jointplot(x = 'CollegeTier', y = 'Domain', data = X, kind = 'hex', color = 'b')
sns.jointplot(x = 'Quant', y = 'Logical', data = df, kind = 'hex', color = 'b')
sns.jointplot(x = 'CollegeID', y = 'Salary', data = df, kind = 'hex', color = 'b')
Instead of overlapping, the plotting window is split in several hexbins, and the number of points per hexbin is counted.
The color denotes this number of points.
The size of the hexagons changes - the scale of the color bar guide is redefined accordingly and We can change the size of the bins using the gridsize argument.
We get a clear picture of density, distributions, and relative ranges, similar to a heat map.
The shape of the hexagon allows us to limit the effects of edge biases found in square bins, while retaining the ability to form a continuous grid.
sns.pairplot(df)